In the rapidly evolving landscape of financial technology, credit scoring remains a cornerstone in determining the creditworthiness of applicants. This paper delves into the realm of predictive modeling, using a dataset provided by our professors, enriched with comprehensive credit scoring information. The overarching objective of our study is to train a model that efficiently and accurately determines the eligibility of individuals for credit approval.

The task set forth by our professors involves a comprehensive journey through four key exercises. The first exercise is centered around a thorough analysis of the dataset. Here, we delve into understanding the underlying structures, identifying critical variables, and most importantly, balancing the dataset to ensure a fair representation of diverse credit scenarios.

Subsequently, in the second exercise, we focus on training and testing a logistic classifier. This step is crucial in establishing a baseline model from which further improvements can be measured.

The third exercise involves enhancing the predictive performance of the model. This stage is particularly challenging and significant, as it entails fine-tuning the model to ensure it captures the nuances of credit scoring with greater accuracy and efficiency.

Finally, the fourth exercise invites us to step into the practical world. Here, we explore the various challenges a company may face if they were to implement our model in a real-world scenario. This exercise not only grounds our theoretical work in reality but also provides valuable insights into the practical implications and considerations in deploying machine learning models in the credit industry.

This paper aims to not only present a robust model for credit scoring but also to contribute to the broader understanding of how machine learning can be effectively utilized in financial decision-making processes.

Load packages and data

Excercise 1

1.1 Describe the data. Specifically:

1.1.1: Check and report the structure of the data set.

str(loan_sample)
## spc_tbl_ [40,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ loan_amnt          : num [1:40000] 10000 16000 6000 9000 24000 8000 3600 5550 2000 10000 ...
##  $ int_rate           : num [1:40000] 11.47 9.49 17.99 9.75 12.59 ...
##  $ grade              : chr [1:40000] "B" "B" "D" "B" ...
##  $ home_ownership     : chr [1:40000] "RENT" "MORTGAGE" "RENT" "MORTGAGE" ...
##  $ annual_inc         : num [1:40000] 35000 110000 40000 54000 66000 ...
##  $ verification_status: chr [1:40000] "Verified" "Not Verified" "Verified" "Verified" ...
##  $ purpose            : chr [1:40000] "debt_consolidation" "debt_consolidation" "home_improvement" "car" ...
##  $ dti                : num [1:40000] 22.05 15.67 17.34 3.58 12.25 ...
##  $ open_acc           : num [1:40000] 15 9 8 11 8 10 8 5 8 14 ...
##  $ revol_bal          : num [1:40000] 10211 10068 3755 1459 29656 ...
##  $ revol_util         : num [1:40000] 31.5 32.7 25.5 22.8 70.6 34.3 36.1 66.1 24.6 75.6 ...
##  $ total_acc          : num [1:40000] 17 21 19 34 21 17 9 12 30 21 ...
##  $ total_rec_int      : num [1:40000] 1574 2364 943 436 2778 ...
##  $ application_type   : chr [1:40000] "Individual" "Individual" "Individual" "Joint App" ...
##  $ tot_cur_bal        : num [1:40000] 17440 35104 19995 8713 29656 ...
##  $ total_rev_hi_lim   : num [1:40000] 32400 30800 14700 6400 42000 28200 21400 6200 17500 30300 ...
##  $ Status             : num [1:40000] 0 0 0 0 0 0 1 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   loan_amnt = col_double(),
##   ..   int_rate = col_double(),
##   ..   grade = col_character(),
##   ..   home_ownership = col_character(),
##   ..   annual_inc = col_double(),
##   ..   verification_status = col_character(),
##   ..   purpose = col_character(),
##   ..   dti = col_double(),
##   ..   open_acc = col_double(),
##   ..   revol_bal = col_double(),
##   ..   revol_util = col_double(),
##   ..   total_acc = col_double(),
##   ..   total_rec_int = col_double(),
##   ..   application_type = col_character(),
##   ..   tot_cur_bal = col_double(),
##   ..   total_rev_hi_lim = col_double(),
##   ..   Status = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Checking for NAs
any(is.na(loan_sample))
## [1] FALSE

The result of “FALSE” from the check indicates that there are no missing values (NAs) in the “loan_sample” dataset.

1.1.2: How many numeric and how many categorical variables are included in the data? What categorical variable has the most levels in it?

# First we have to check whether all variables have been imported in the correct format. 
first_overview <- overview(loan_sample)
plot(first_overview)

It is clearly visible that the columns with the data type character still have to be converted to factors. In addition, the “Status” column is currently still a numeric data type. This must also be converted to factor, as it is a binary variable.

loan_sample <- loan_sample %>%
  mutate_if(is.character, as.factor) %>% # Convert all character columns to factors
  mutate(Status = as.factor(Status)) # Convert 'Status' column to factor

# Plot to see, if the datatypes are correct
second_overview <- overview(loan_sample)
plot(second_overview)

That looks right, now we can count the number of variables.

# Count numeric variables in 'loan_sample'
numeric_vars_count <- sum(sapply(loan_sample, is.numeric))

# Count categorical (factor) variables in 'loan_sample'
categorical_vars_count <- sum(sapply(loan_sample, is.factor))

# Print counts of numeric and categorical variables
cat("There are", numeric_vars_count, "numeric variables and", categorical_vars_count, "categorical variables in the dataset.")
## There are 11 numeric variables and 6 categorical variables in the dataset.
categorical_levels <- sapply(loan_sample, function(x) if(is.factor(x)) length(unique(x)) else NA)

# Find the name of the categorical variable with the most levels
cat_var_most_levels <- names(which.max(categorical_levels))

# Find the number of levels for this variable
levels_count <- max(categorical_levels, na.rm = TRUE)

# Print the variable name and the number of levels
cat("The categorical variable with the most levels in the dataset is:", cat_var_most_levels, "with", levels_count, "levels.")
## The categorical variable with the most levels in the dataset is: purpose with 13 levels.

1.1.3: Summarize the variables. Discuss the summary statistics obtained.

# Categorial Label with the most Levels
summary(loan_sample)
##    loan_amnt        int_rate     grade      home_ownership    annual_inc    
##  Min.   : 1000   Min.   : 5.31   A: 7329   MORTGAGE:17701   Min.   :  5000  
##  1st Qu.: 7000   1st Qu.: 9.44   B:13166   OWN     : 3981   1st Qu.: 42000  
##  Median :10075   Median :12.29   C:11842   RENT    :18318   Median : 57000  
##  Mean   :11687   Mean   :12.62   D: 7663                    Mean   : 63400  
##  3rd Qu.:15000   3rd Qu.:15.05                              3rd Qu.: 77000  
##  Max.   :40000   Max.   :27.49                              Max.   :400000  
##                                                                             
##       verification_status               purpose           dti       
##  Not Verified   :14373    debt_consolidation:23342   Min.   : 0.00  
##  Source Verified:16116    credit_card       : 9362   1st Qu.:12.13  
##  Verified       : 9511    other             : 2337   Median :17.60  
##                           home_improvement  : 2099   Mean   :18.22  
##                           major_purchase    :  794   3rd Qu.:23.86  
##                           medical           :  444   Max.   :60.14  
##                           (Other)           : 1622                  
##     open_acc       revol_bal       revol_util       total_acc    
##  Min.   : 1.00   Min.   :    0   Min.   :  0.00   Min.   : 3.00  
##  1st Qu.: 8.00   1st Qu.: 5615   1st Qu.: 34.80   1st Qu.:15.00  
##  Median :10.00   Median : 9818   Median : 52.40   Median :20.00  
##  Mean   :10.31   Mean   :11995   Mean   : 52.17   Mean   :21.27  
##  3rd Qu.:13.00   3rd Qu.:15832   3rd Qu.: 70.00   3rd Qu.:27.00  
##  Max.   :23.00   Max.   :78762   Max.   :121.40   Max.   :57.00  
##                                                                  
##  total_rec_int      application_type  tot_cur_bal     total_rev_hi_lim
##  Min.   :   0.0   Individual:39450   Min.   :     0   Min.   :   300  
##  1st Qu.: 673.2   Joint App :  550   1st Qu.: 25136   1st Qu.: 13000  
##  Median :1342.5                      Median : 53722   Median : 20900  
##  Mean   :1818.5                      Mean   : 99100   Mean   : 24194  
##  3rd Qu.:2432.9                      3rd Qu.:157834   3rd Qu.: 32200  
##  Max.   :8834.9                      Max.   :472573   Max.   :100000  
##                                                                       
##  Status   
##  0:34810  
##  1: 5190  
##           
##           
##           
##           
## 

Our loan data includes information on 40,000 loans. The average loan size is about $11,687, but this number can be as low as $1,000 or as high as $40,000, showing that loan amounts are different for different people. The average interest rate on a loan is 12.62%, but some people get lower rates like 5.31%, and others get higher rates up to 27.49%. This shows there are many different interest rates that people are getting. People who have taken out loans make an average of $63,400 a year, but there is a big difference in how much money people make. Some make more, and some make less. The debt-to-income ratio (DTI) tells us how much debt people have compared to their income. On average, this number is 18.22% for the loans we looked at. Looking at how many credit lines people have open, the average is 10, but in total, they might have had up to 21 over time. This tells us about how many loans or credit cards people might be using or have used before. The average amount people owe on their credit lines is $11,995, and they use about half of the credit available to them. This tells us about how much of their available loan money they are using. People also pay about $1,818 in interest on average. This is extra money they pay in addition to the loan they took out. When we look at all the loans and credit people have, the total amount they owe on average is $99,100. This shows that people have a lot of loans or owe a lot of money on their credit lines. Finally, the average limit people have on their credit lines is $24,194. This is how much the bank or loan company lets them borrow in total.

1.1.4: Check the levels of the target variable by choosing the appropriate visualization. Is the target variable balanced?

# Plot the distribution of the target variable
ggplot(loan_sample, aes(x = Status, fill = Status)) + 
  geom_bar() +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = 'Distribution of target variable (Status)', x = 'Status', y = 'Count') +
  theme_minimal()

The target variable ‘Status’ in our dataset distinctly categorizes credit approval outcomes into two levels: ‘0’ represents a ‘No’, indicating that the credit has not been approved, whereas ‘1’ signifies a ‘Yes’, confirming the approval of the credit.

The target variable in our dataset shows a significant imbalance, with a substantially higher number of credits not being approved. This imbalance is important to address because, in machine learning, an unbalanced dataset can lead to biased models that over-predict the majority class, in this case, the non-approved credits. Consequently, balancing the dataset is crucial for training an accurate and reliable logistic classifier that fairly represents both outcomes.

1.1.5: Check the distribution of the numeric variables in the data set (include different visual representations).

# Histograms
loan_sample %>% 
  select_if(is.numeric) %>% 
  gather(key = "variable", value = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~variable, scales = 'free_x') +
  theme_minimal()

The histogram we created for the variable open_acc has gaps. We should examine this more closely.

# Create boxplot of the variable total_acc
ggplot(loan_sample, aes(x = open_acc)) +
  geom_histogram(bins = 22) + # Adjust the number of bins as necessary
  labs(x = "Open Accounts", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Since the variable ranges from 1 to 22, using 30 bins leads to empty spaces. Adjusting the bins to match the actual value range has provided a clearer, gap-free histogram.

# Boxplots
loan_sample %>% 
  select_if(is.numeric) %>% 
  gather(key = "variable", value = "value") %>%
  ggplot(aes(y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = 'free') +
  theme_minimal()

1.2: Investigating outliers with a box plot

Elaborate your view on how to proceed in dealing with the outliers and – if necessary – take appropriate action.

# Reshape the data to long format and scale the numeric values
loan_sample_long_scaled <- loan_sample %>%
  select(where(is.numeric)) %>%
  mutate(across(everything(), scale)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

# Create a boxplot with the scaled values
ggplot(loan_sample_long_scaled, aes(x = variable, y = value)) +
  geom_boxplot() +
  labs(x = "", y = "Scaled Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

The box plot illustrated the the distribution of scaled numeric variables. Outliers are visible as individual points outside the box plot. It is visible that most of the variables have some outliers.

When considering how to deal with outliers, it is important to understand their nature and the potential impact on the analysis. If outliers represent errors or anomalies that are not characteristic of the population, they can be removed or corrected.

# Diagnosing outliers and selecting variables with more than 1% outliers
variables_with_high_outlier_ratio <- loan_sample %>%
  diagnose_outlier() %>%
  filter(outliers_ratio > 1) %>%
  pull(variables)

# Plotting the outliers for these variables
loan_sample %>%
  plot_outlier(one_of(variables_with_high_outlier_ratio))

# Function for adjusting the values to the quantile limits
cap_values_at_quantiles <- function(x) {
    quantiles <- quantile(x, c(0.01, 0.99), na.rm = TRUE)
    x <- ifelse(x < quantiles[1], quantiles[1], x)
    x <- ifelse(x > quantiles[2], quantiles[2], x)
    x
}

loan_sample_truncated <- loan_sample %>%
  mutate(across(where(is.numeric), cap_values_at_quantiles))

1.3: Create a new balanced data set where the two levels of the target variable will be equally represented; Create a bar plot of the newly created target variable. Why is this step necessary?

loan_sample_balanced <- ovun.sample(Status ~ ., data=loan_sample_truncated, method = "under")
loan_sample_under <- data.frame(loan_sample_balanced[["data"]])

# Plot the balance
ggplot(loan_sample_under, aes(x = Status, fill = Status, group = Status)) +
  geom_bar() +
  ylab("Count") +
  xlab("Status of the loan") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  ggtitle("New balanced dataset")

Balancing a dataset is important for training classification models because it prevents the model from being biased towards the majority class and potentially ignoring the minority class. This is particularly true for datasets where the outcome class of interest (e.g., default on a loan) is much less common than the non-interest class (e.g., non-default). Balancing helps to improve the generalization of the model and its performance on unseen data, especially for the minority class.

1.4: Investigating the distribution of the numeric features per the two levels of our target feature

(i.e. default vs non-default). Discuss the visualizations. Which variables seem to be relevant in predicting the target feature?

# Define columns to exclude (these are column with characters)
exclude_columns <- c(3, 4, 6, 7, 14)

# Loop through columns, excluding the specified ones
for (i in setdiff(1:ncol(loan_sample_under), exclude_columns)) {
  # Only plot numeric columns
  if (is.numeric(loan_sample_under[[i]])) {
    print(ggplot(loan_sample_under, aes_string(y = names(loan_sample_under)[i], color = "Status")) +
            geom_boxplot() +
            ylab(names(loan_sample_under[i])) +
            theme(axis.title.x = element_blank(),
                  axis.text.x = element_blank(),
                  axis.ticks.x = element_blank()))
  }
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Hier noch text schreiben!

1.5: Brotuta algorithm

In the next step, we run the Boruta algorithm.The Boruta algorithm provides a more objective method for determining the importance of features as it is based on repeated, random comparisons rather than human judgment.

##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
##  11. run of importance source...
## After 11 iterations, +23 secs:
##  confirmed 14 attributes: annual_inc, dti, grade, home_ownership, int_rate and 9 more;
##  rejected 1 attribute: application_type;
##  still have 1 attribute left.
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
##  15. run of importance source...
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
##  20. run of importance source...
##  21. run of importance source...
##  22. run of importance source...
##  23. run of importance source...
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
##  27. run of importance source...
##  28. run of importance source...
##  29. run of importance source...
##  30. run of importance source...
##  31. run of importance source...
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
##  76. run of importance source...
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
##  84. run of importance source...
##  85. run of importance source...
##  86. run of importance source...
##  87. run of importance source...
##  88. run of importance source...
##  89. run of importance source...
##  90. run of importance source...
##  91. run of importance source...
##  92. run of importance source...
##  93. run of importance source...
##  94. run of importance source...
##  95. run of importance source...
##  96. run of importance source...
##  97. run of importance source...
##  98. run of importance source...
##  99. run of importance source...
##  [1] "loan_amnt"           "int_rate"            "grade"              
##  [4] "home_ownership"      "annual_inc"          "verification_status"
##  [7] "purpose"             "dti"                 "open_acc"           
## [10] "revol_bal"           "revol_util"          "total_acc"          
## [13] "total_rec_int"       "tot_cur_bal"         "total_rev_hi_lim"

Hier noch genauer erklären! Now, we can examine multicollinearity, which occurs when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This situation can result in skewed or misleading coefficient estimates and conclusions.

1.6: Visualize the correlations that emerge between the numerical features. Discuss the results. Which variables are highly correlated? Decide whether you keep all variables.

correlations = cor(loan_sample_under[-c(3,4,6,7,14,17)])
corrplot(correlations) 

We can also test for the significance of the correlation. The cor_pmat function allow us to visualize the correlations explicitly marking the insignificant correlations.

p_value_mat <- cor_pmat(loan_sample_under[,-c(3,4,6,7,14,17)])
ggcorrplot(correlations, type = "lower", p.mat = p_value_mat) 

Strong Correlations: Variables with high absolute values of correlation (close to 1 or -1) are considered strongly correlated. These could indicate a redundant or derived relationship where one variable can be predicted from the other. Moderate to Weak Correlations: Values that are closer to 0 indicate a weaker relationship. Looking at the correlation plots and the output from the Boruta algorithm, we would keep all variables.

1.7: Use a bar plot visualization to investigate the associations between the categorical variables and the target feature.

# Generate a bar plot for each categorical variable
categorical_vars <- c("home_ownership", "verification_status", "purpose", "addr_state", "application_type")
# Loop over categorical variables and plot using aes() and without aes_string()
for (cat_var in categorical_vars) {
  # Check if the column exists to avoid errors
  if (!cat_var %in% names(loan_sample_under)) {
    message(paste("Skipping", cat_var, "as it is not found in the dataset."))
    next
  }
  
  # Create the plot
  plot <- loan_sample_under %>%
    group_by(.data[[cat_var]], Status) %>%
    summarise(Count = n(), .groups = 'drop') %>%
    ggplot(aes(x = .data[[cat_var]], y = Count, fill = as.factor(Status))) +
    geom_bar(stat = "identity", position = position_dodge()) +
    labs(title = paste("Bar plot of", cat_var, "by Status"), x = cat_var, y = "Count") +
    scale_fill_discrete(name = "Status") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x labels for readability
   # Print the plot
  print(plot)
}

## Skipping addr_state as it is not found in the dataset.

Hier Text schreiben!

1.7: Plot an interactive scatter plot of the association between the loan amount requested and the annual income of the borrower. Discuss the plot. What can you tell about the association?

library(plotly)

# Assuming your data is in a dataframe called loan_sample and
# the relevant columns are named loan_amnt and annual_inc
plot_ly(data = loan_sample_under, x = ~loan_amnt, y = ~annual_inc, 
        type = 'scatter', mode = 'markers', 
        marker = list(size = 10, opacity = 0.5)) %>%
  layout(title = 'Association between Loan Amount and Annual Income',
         xaxis = list(title = 'Loan Amount Requested'),
         yaxis = list(title = 'Annual Income of Borrower'))

The areas of high density indicate common combinations of income and loan amounts, which could suggest standard loan products or typical borrower profiles. A wide range of incomes with a relatively narrow range of loan amounts might suggest that the loan amount is less sensitive to income past a certain threshold.

Excercise 2 Train and test a logistic classifier.

Hier Text schreiben !!! ### 2.1: Dividing the sample into training and testing set

# Split the data into training (70%) and testing (30%) sets
splitIndex <- createDataPartition(loan_sample_under$Status, p = 0.7, list = FALSE)
training_set <- loan_sample_under[splitIndex,]
testing_set <- loan_sample_under[-splitIndex,]

Hier Text schreiben!

PercTable(loan_sample_under$Status)
##                
##     freq   perc
##                
## 0  5'191  50.0%
## 1  5'190  50.0%
PercTable(training_set$Status)
##                
##     freq   perc
##                
## 0  3'634  50.0%
## 1  3'633  50.0%
PercTable(testing_set$Status)
##                
##     freq   perc
##                
## 0  1'557  50.0%
## 1  1'557  50.0%

2.2: Train the classifier and report the coefficients obtained and interpret the results.

In the next step, we train the logit model. In terms of our inputs i.e. our Xs, we use all variables included in the data_new_under apart from the status, which is our Y. How would you interpret the results printed from the summary fit1?

## 
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -2.682e+00  3.262e-01  -8.222  < 2e-16 ***
## loan_amnt                           5.467e-05  5.999e-06   9.114  < 2e-16 ***
## int_rate                            1.149e-01  1.535e-02   7.485 7.15e-14 ***
## gradeB                              3.516e-01  1.042e-01   3.373 0.000743 ***
## gradeC                              5.109e-01  1.360e-01   3.757 0.000172 ***
## gradeD                              5.576e-01  2.012e-01   2.772 0.005574 ** 
## home_ownershipOWN                  -1.002e-01  9.276e-02  -1.080 0.280089    
## home_ownershipRENT                  2.076e-01  6.804e-02   3.051 0.002277 ** 
## annual_inc                         -4.556e-06  1.192e-06  -3.823 0.000132 ***
## verification_statusSource Verified  1.225e-01  5.958e-02   2.056 0.039766 *  
## verification_statusVerified         9.698e-02  6.820e-02   1.422 0.155005    
## purposecredit_card                  1.181e-01  2.515e-01   0.469 0.638794    
## purposedebt_consolidation           1.752e-01  2.477e-01   0.707 0.479275    
## purposehome_improvement             3.032e-01  2.698e-01   1.123 0.261247    
## purposehouse                       -3.351e-01  4.328e-01  -0.774 0.438712    
## purposemajor_purchase              -1.399e-02  3.030e-01  -0.046 0.963176    
## purposemedical                      2.914e-01  3.397e-01   0.858 0.390989    
## purposemoving                      -5.641e-01  3.843e-01  -1.468 0.142168    
## purposeother                       -5.626e-02  2.646e-01  -0.213 0.831613    
## purposerenewable_energy             1.316e-01  8.686e-01   0.152 0.879572    
## purposesmall_business               2.460e-01  3.475e-01   0.708 0.478973    
## purposevacation                     1.913e-01  3.477e-01   0.550 0.582132    
## purposewedding                     -9.467e-01  6.998e-01  -1.353 0.176128    
## dti                                 1.307e-02  3.660e-03   3.569 0.000358 ***
## open_acc                            5.531e-02  9.498e-03   5.824 5.74e-09 ***
## revol_bal                          -6.262e-06  7.839e-06  -0.799 0.424403    
## revol_util                          2.140e-03  1.985e-03   1.078 0.281017    
## total_acc                          -1.095e-02  3.928e-03  -2.788 0.005310 ** 
## total_rec_int                      -2.394e-04  2.279e-05 -10.509  < 2e-16 ***
## application_typeJoint App           4.315e-01  2.128e-01   2.028 0.042558 *  
## tot_cur_bal                        -7.646e-07  3.818e-07  -2.003 0.045213 *  
## total_rev_hi_lim                   -3.826e-06  4.463e-06  -0.857 0.391348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10074.2  on 7266  degrees of freedom
## Residual deviance:  9188.7  on 7235  degrees of freedom
## AIC: 9252.7
## 
## Number of Fisher Scoring iterations: 4

We can print out only the significant variables with p-value lower than 0.05. We notice that 9 variables are found statistically significant.

##  [1] "loan_amnt"                          "int_rate"                          
##  [3] "gradeB"                             "gradeC"                            
##  [5] "gradeD"                             "home_ownershipRENT"                
##  [7] "annual_inc"                         "verification_statusSource Verified"
##  [9] "dti"                                "open_acc"                          
## [11] "total_acc"                          "total_rec_int"                     
## [13] "application_typeJoint App"          "tot_cur_bal"

2.3: Plot the ROC and the Precision/Recall Curve and interpret the results.

Next, we aim to evaluate the predictive performance of our model. To do so, we will plot the ROC curve.

testing_set$fit1_score <- predict(fit1,type='response',testing_set)
fit1_pred <- prediction(testing_set$fit1_score, testing_set$Status)
fit1_roc <- performance(fit1_pred, "tpr", "fpr")
plot(fit1_roc, lwd=1, colorize = TRUE, main = "Fit1: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)

In the next step, we visualize the Precision/Recall Curve. This curve summarizes the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

fit1_precision <- performance(fit1_pred, measure = "prec", x.measure = "rec")
plot(fit1_precision, main="Fit1: Logit - Precision vs Recall")

Hier noch genau erklären und herausfinden warum es ein Sprung macht. Erklärung von Chat GPT: Die Kurve kann Unregelmäßigkeiten in der Verteilung der Klassifizierungswahrscheinlichkeiten widerspiegeln. Ein kleiner Satz von Beispielen, die sehr sicher als positiv klassifiziert werden, gefolgt von einer größeren Gruppe von Beispielen mit weniger Sicherheit, könnte zu einem solchen Sprung führen.

2.4: Produce the confusion matrix and interpret the results.

# Extract the confusion matrix 
cm <- confusionMatrix(as.factor(round(testing_set$fit1_score)), testing_set$Status)
print(cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  984  550
##          1  573 1007
##                                           
##                Accuracy : 0.6394          
##                  95% CI : (0.6222, 0.6563)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.2787          
##                                           
##  Mcnemar's Test P-Value : 0.5115          
##                                           
##             Sensitivity : 0.6320          
##             Specificity : 0.6468          
##          Pos Pred Value : 0.6415          
##          Neg Pred Value : 0.6373          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3160          
##    Detection Prevalence : 0.4926          
##       Balanced Accuracy : 0.6394          
##                                           
##        'Positive' Class : 0               
## 

2.5 Report the AUC values and the overall accuracy and interpret the results.

# AUC
fit1_auc <- performance(fit1_pred, measure = "auc")

# Overall accuracy
accuracy <- sum(diag(cm$table)) / sum(cm$table)

# Print of the values
cat("AUC: ", fit1_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy)
## AUC:  69.51167 
## Overall Accuracy:  0.6393706

Excercise 3

Thinking about the pre-processing steps that you carried out before training the logistic classifier: ### 3.1: Can you think of a way to improve the predictive performance of your data?

3.2: What can you do differently? (hint: Feel free to be creative and discuss any additional step in data collection and/or data pre-processing that you might try so to improve the results)

Excercise 4

Finally, thinking about putting your model into action and basing credit decisions on the prediction that it generates: • What kind of challenges may a company face if it would use your model in their daily business, in particular in regard to ethical challenges and moral obligations companies have? Please refer to the „common ethical issues in the context the creation of value from data” (see slides week 11) in your answer. • Can you think of a way how companies can overcome or at least mitigate the issues that you described above?